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Abstract: 

Optimally hybrid numerical solvers were constructed for massively parallel generalized eigenvalue problem (GEP). 
The strong scaling benchmark was carried out on the K computer and other supercomputers for electronic structure 
calculation problems in the matrix sizes of M = 10^-10^ with upto 10^ cores. The procedure of GEP is decom¬ 
posed into the two subprocedures of the reducer to the standard eigenvalue problem (SEP) and the solver of SEP A 
hybrid solver is constructed, when a routine is chosen for each subprocedure from the three parallel solver libraries 
of ScaLAPACK, ELPA and EigenExa. The hybrid solvers with the two newer libraries, ELPA and EigenExa, give 
better benchmark results than the conventional ScaLAPACK library. The detailed analysis on the results implies that 
the reducer can be a bottleneck in next-generation (exa-scale) supercomputers, which indicates the guidance for future 
research. The code was developed as a middleware and a mini-application and will appear online. 

Keywords: Massively parallel numerical library. Generalized eigenvalue problem. Electronic structure calculation, 
ELPA, EigenExa, The K computer, mini-application 


1. Introduction 

Numerical linear algebraic solvers for large matrices have 
strong needs among various applications with the current and 
next-generation supercomputers. Nowadays ScaLAPACK[l], [2] 

is the de facto standard solver library for parallel computations 
but several routines give severe bottlenecks in the computational 
speed with current massively parallel architectures. Novel solver 
libraries were proposed so as to overcome the bottlenecks. Since 
the performance of numerical routines varies significantly with 
problems and architectures, the best performance is achieved, 
when one constructs an optimal ‘hybrid’ among the libraries. 



Fig. 1 Concept of hybrid solver; Structure of the program code (a) without 
and (b) with hybrid solver or numerical middleware. 
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The concept of hybrid solver is illustrated in Fig. 1. It is a nu¬ 
merical middleware and has a unique data interface to real appli¬ 
cations. One can choose the optimal workflow for each problem 
without any programming effort. 

The present paper focuses on dense-matrix solvers for general¬ 
ized eigenvalue problems (GEPs) in the form of 

Ayk = hByk ( 1 ) 

with the given M X M real-symmetric matrices of A and B. 
The matrix B is positive definite. The eigenvalues [Ak] and the 
eigenvectors [yk) will be calculated. The computational cost is 
O(M^) or is proportional to M^. The present hybrid solvers are 
constructed among ScaLAPACK and the two newer libraries of 
ELPA [3], [4], [5] *2, and EigenExa [6], [7], [8], [9]. The ELPA 
and EigenExa libraries are written in Fortran and appeared in 
2000’s for efficient massively parallel computations. 

The present paper is organized as follows; Section 2 explains 
the background from the electronic structure calculation. Section 
3 describes the mathematical foundation. Sections 4 and 5 are 
devoted to the benchmark results and discussions, respectively. 
The summary and future outlook will appear in Sec. 6. 

2. Background 

2.1 Large-scale electronic structure calculations 

The GEP of Eq. (1) gives the mathematical foundation of elec¬ 
tronic structure calculations or quantum mechanical calculations 
of materials, in which an electron is treated as a quantum me- 

ELPA = Eigenvalue soLvers for Petascale Applications 
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Fig. 2 (a) The upper panel is a ;r-type electronic wavefunction in an 

amorphous-like conjugated polymer (poly-((9,9) dioctyl fluorine)). 
The lower panel shows the atomic structure (R^CgHn) [10]. (b) 
Strong scaling plot by ELSES for one-hundred-million-atoms calcu¬ 
lations on the K computer. [10], [11] The calculated materials are a 
nano-composite carbon solid (the upper line) and the amorphous-like 
conjugated polymer (the lower line). The number of used processor 
nodes are from P = 4,096 to 82,944 (full nodes of the K computer). 

chanical ‘wave’. The input matrix A or 5 of Eq. (1) is called 
Hamiltonian or the overlap matrices, respectively. An eigenvalue 
of {Ak} is the energy of one electron and an eigenvector of {y^} 
specifies the wavefunction or the shape of an electronic ‘wave’. 
Fig. 2(a) shows an example of the wavefunction. The number of 
the required eigenvalues is, at least, on the order of the number of 
the electrons or the atoms in calculated materials. See the ELPA 
paper [3] for a review, because ELPA was developed under tight 
collaboration with electronic structure calculation society. 

Here, our motivation is explained. The present authors devel¬ 
oped a large-scale quantum material simulator called ELSES 
[12], [13]. The theories are explained in Refs. [13], [14] and 
the reference therein. The matrices are based on the real-space 
atomic-orbital representation and the matrix size M is nearly pro¬ 
portional to the number of atoms N (M oc N). The simulations 
mainly use novel ‘order-A’ linear-algebraic methods in which the 
computational cost is ‘order-A’ ((9(A)) or is proportional to the 
number of atoms A. Their mathematical foundation is sparse- 
matrix (Krylov-subspace) solvers. Efficient massively parallel 
computation is found in Fig. 2, a strong scaling benchmark on 
the K computer [10], [11] with one hundred million atoms or 
one-hundred-nanometer scale materials. The simulated materi¬ 
als are a nano-composite carbon solid with A = 103,219,200 or 
M = 412, 876, 800 [10] and an amorphous-like conjugated poly¬ 
mer with A = 102,238, 848 or M = 230,776,128 [11]. 

The present dense-matrix solvers are complementary methods 
to the order-A calculations, because the order-A calculation gives 
approximate solutions, while the dense-matrix solvers give nu¬ 
merically exact ones with a heavier ((9(M^)) computational cost. 
The use of the two methods will lead us to fruitful researches. The 
exact solutions are important, for example, when the system has 
many nearly degenerated eigen pairs and one would like to distin¬ 
guish them. The exact solutions are important also as reference 
data for the development of fine approximate solvers. 

The matrices of A and B in the present benchmark appear on 
‘ELSES Matrix Library’. [15] The Library is the collection of the 
matrix data generated by ELSES for material simulations. The 
benchmark was carried out with the data files of ‘NCCS430080’, 
‘VCNT22500’ ‘VCNT90000’ and ‘VCNT1008000’ for the ma¬ 
trix sizes of M=22,500, M=90,000, M=430,080, M= 1,008,000, 
respectively. A large matrix data (> 0.5GB) is uploaded as a set 

ELSES = Extra-Large-Scale Electronic Structure calculation 


of Split files for user’s convenience. 

The physical origin of the matrices is explained briefly. The 
files in the present benchmark are carbon materials within mod¬ 
eled tight-binding-form theories based on ab initio calculations. 
The matrix of ‘NCCS430080’ appears in our material research on 
a nano-composite carbon solid (NCCS) [16]. An sp-orbital form 
[17] is used and the system contains A = M/4 = 107,520 atoms. 
The other files are generated for thermally vibrated single-wall 
carbon nanotubes (VCNTs) within a supercell. An spd-orbital 
form [18] is used and each system contains A = M/9 atoms. The 
VCNT systems were prepared, so as to generate matrices system¬ 
atically in different size with similar eigenvalue distributions. We 
used these matrices for the investigation on ;r-electron materials 
with the present dense-matrix solver and the order-A solver. 



Fig. 3 Workflow of the hybrid GEP solver. 


3. The hybrid solvers 

A hybrid solver is constructed, when a routine is chosen for 
each subprocedure from ScaLAPACK, EigenExa and ELPA. The 
code was developed as a general middleware that can be con¬ 
nected not only to ELSES but also to any real application soft¬ 
ware, as in Fig. 1. A mini-application was also developed and 
used in the present benchmark. In the benchmark, ScaLAPACK 
was used as a built-in library on each machine. EigenExa in the 
version 2.2a and ELPA in the version 2014.06.001 were used. 
ELPA and EigenExa call some ScaLAPACK routines. 

3.1 Mathematical formulation 

The GEP of Eq. (1) can be written in a matrix form of 

AY = BY A, (2) 

where the matrix A = diag(/li, /I 2 ,...) is diagonal and the matrix 
y = (yiUi ■) satisfies Y^BY = I. In the solvers, the GEP of 
Eq. (1) is reduced to a standard eigenvalue problem (SEP) of 

The present matrices are sparse, which does not lose the generality of the 
benchmark, since the cost of the dense matrix solver is not dependent on 
the number of non-zero elements of the matrix. 

The present EigenExa package does not include the GEP solver. The 
GEP solver routine for EigenExa in the present paper is that of the ver¬ 
sion 2.2b of KMATH_EIGEN_GEV [19] that shares the SEP solver rou¬ 
tine with the EigenExa package. 
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A'Z = ZA, (3) 

where the reduced matrix A' is real symmetric [20] and the ma¬ 
trix ofZ = {ziZ 2 • • •) contain eigenvectors of A'. The reduction 
procedure can be achieved, when the Cholesky factorization of B 
gives the Cholesky factor U as an upper triangle matrix: 

B=U'^U. (4) 

The reduced matrix A' is defined by 

A' = U~'^AU~\ (5) 

The eigenvectors of the GEP, written as T = {y\y 2 • • •). are cal¬ 
culated from those of the SEP by 

Y = [/“'Z. (6) 

This procedure is usually called backward transformation. 

The GEP solver is decomposed into the two subprocedures of 
(a) the solver of the SEP in Eq. (3) and (b) the reduction from the 
GEP to the SEP ((A, B) => A') and the backward transformation 
(Z => T). The subprocedures (a) and (b) are called ‘SEP solver’ 
and ‘reducer’, respectively, and require O(M^) operations. 

Eigure 3 summarizes the workflows of the possible hybrid 
solvers. A hybrid solver is constructed, when one choose the rou¬ 
tines for (a) the SEP solver and (b) the reducer, respectively. 

Eor (a) the SEP solver, five routines are found in the base li¬ 
braries; One routine is a ScaLAPACK routine (routine name in 
the code : ‘pdsyevd’) that uses the conventional tridiagonaliza- 
tion algorithm. [21] The ELPA or EigenExa library contains 
a SEP solver routine based on the tridiagonalization algorithm. 
The routine in ELPA is called ‘ELPAl’ (routine name in the 
code : ‘solve_evp-real’) in this paper, as in the original paper 
[3], and the one in EigenExa called ‘Eigen.s’ or ‘EIGS’ (rou¬ 
tine name in the code : ‘eigen.s’). ELPA and EigenExa also 
contain the novel SEP solvers based on the narrow-band reduc¬ 
tion algorithms without the conventional tridiagonalization pro¬ 
cedure. The solvers are called ‘ELPA2’ (routine name in the code 
: ‘solve_evp_real_2stage’) for the ELPA routine and ‘Eigen.sx’ or 
‘EIGX’ (routine name in the code : ‘eigen.sx’) for the EigenExa 
routine in this paper. See the papers [4], [8] for details. 

Eor (b) the reducer, three routines are found in the base libraries 
and are called ScaLAPACK style, ELPA style, and EigenExa 
style reducers in this paper. In the ScaLAPACK style, the 
Cholesky factorization, Eq. (4) is carried out and then the reduced 
matrix A', defined in Eq. (5), is generated by a recursive algorithm 
(routine name ‘pdsygst’) without explicit calculation of U~^ nor 
U ~^. Details of the recursive algorithm are explained, for exam¬ 
ple in Ref. [22]. In the ELPA style, the Cholesky factorization 
(routine name: ‘cholesky.real’) is carried out, as in the ScaLA¬ 
PACK style, and the reduced matrix A' is generated by the explicit 
calculation of the inverse (triangular) matrix R = U~^ (routine 
names : ‘invert_trm_rear) and the explicit successive matrix mul¬ 
tiplication of A' = (R^A)R (routine names: ‘mult_at_b_rear) [3] 
In the EigenExa style, the Cholesky factorization is not used. 

The benchmark was carried out in an ELPA style reduction algorithm. 
The ScaLAPACK routine of ‘pdtrmm’ is used for the multiplication of 
the triangular matrix R from right, while a sample code in the ELPA 
package uses the ELPA routine (‘mult_at_b_rear). We ignore the differ¬ 
ence, since the elapse time of the above procedure is not dominant. 


Instead, the SEP for the matrix B 

BW = WD, (7) 

is solved by the SEP solver (Eigen.sx), with the diagonal matrix 
of D = diag(Ji, ^ 2 . •••) and the unitary matrix of W = (lUi 1U2 ....)• 
A reduced SEP in the form of Eq. (3) is obtained by 

A' = (8) 

Y = WD~^I^Z, (9) 

because of Z = D^^^W^Y and W~^ = W. Equation (9) is solved 
by the SEP solver (Eigen_sx). 

Though the SEP solver of Eq. (3) requires a larger operation 
cost than the Cholesky factorization (See Eig.l of Ref. [23], for 
example), the elapse time can not be estimated only from the op¬ 
eration costs among the modern supercomputers. 

Table 1 List of the workflows in the benchmark. The routine names for the 
SEP solver and the reducer are shown for each workflow. Abbrevi¬ 
ations are shown within parentheses. 


Workflow 

SEP solver 

Reducer 

A 

ScaLAPACK (SCLA) 

ScaLAPACK (SCLA) 

B 

Eigen_sx (EIGX) 

ScaLAPACK (SCLA) 

C 

ScaLAPACK (SCLA) 

ELPA 

D 

ELPA2 

ELPA 

E 

ELPAl 

ELPA 

F 

Eigen_s (EIGS) 

ELPA 

G 

Eigen_sx (EIGX) 

ELPA 

H 

Eigen_sx (EIGX) 

Eigen_sx (EIGX) 


The benchmark of the hybrid GEP solvers was carried out for 
the eight workflows listed in Table 1. In general, a potential issue 
is the possible overhead of the data conversion process between 
libraries. This issue will be discussed in Sec. 5.2. 

4. Benchmark result 

Strong scaling benchmarks are investigated for the hybrid 
solvers. The elapse times were measured for (i) the full eigenpair 
calculation (Tfun) and (ii) the ‘eigenvalue-only’ calculation (Tevo)- 
In the latter case, the elapse time is ignored for the calculation 
of the eigenvectors. The two types of calculations are important 
among electronic structure calculations. [3] The present bench¬ 
mark ignores small elapse times of the initial procedure for dis¬ 
tributed data and the comments on them will appear in Sec. 5.1. 

The benchmark was carried out on three supercomputers; the K 
computer at Riken, Eujitsu EX 10 and SGI Altix ICE 8400EX. The 
K computer has a single SPARC 64 Vlllfx processor (2.0GHz, 
8-core) on node. The EX 10 is Oakleaf-EX of the University of 
Tokyo. Eujitsu EX 10 is the successor of the K computer and has 
a single SPARC64 IXfx processor (1.848 GHz, 16-core) on each 
node. We also used SGI Altix ICE 8400EX of Institute for 
Solid State Physics of the University of Tokyo. It is a cluster of 
Intel Xeon X5570 (2.93GHz, 8-core). The byte-per-flop value 
(B/E) is B/E=0.5, 0.36 or 0.68, for the K computer, EXIO or SGI 
Altix, respectively. The numbers of used processor nodes P are 
set to be square numbers (P = q^) except in Sec. 4.3, since the 

Additional options of the K computer and LX 10 are explained; We did 
not specify a MPI process shape on the Tofu interconnect. We used the 
rank directory feature to alleviate I/O contention. 
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Table 2 Selected results of the benchmark. The elapse time for the full 
(eigenpair) calculation (Tfuii) and that for the eigenvalue-only cal¬ 
culation (Tevo) with the workflows. The recorded time is the best 
data among ones with different numbers of the used nodes. The 
number of used nodes (P) for the best data is shown within paren¬ 
theses. The best data among the workflows are labelled by ‘[B]’. 
The saturated data are labelled by ‘[S]’. The workflow D' on Al- 
tix is that without the SSE optimized routine of the ‘ELPA2’ SEP 
solver. See the text for details. 


Size M/Machine WE 

Tfuii (sec) 

^evo (sec) 

1,000,080/EX10 G 

39,919 (P = 4,800) 

35,103 (P = 4,800) 

430,080/K A 

11,634 {P= 10,000) 

10,755 (P = 10,000) 

B 

8,953 (P = 10,000) 

8,465 (P = 10,000) 

C 

5,415 (P = 10,000) 

4,657 (P = 10,000) 

D 

4,242 (P = 10,000) 

2,227 (P = 10,000)[B] 

E 

2,990 (P = 10,000) 

2,457 (P = 10,000) 

F 

2,809 (P = 10,000) 

2,416 (P = 10,000) 

G 

2,734 (P = 10,000)[B] 

2,355 (P = 10,000) 

H 

3,595 (P = 10,000) 

3,147 (P = 10,000) 

90,000/K A 

590 (P = 4,096) 

551 (P = 4,096) 

B 

493 (P = 1,024)[S] 

449 (P = 1,024)[S] 

C 

318 (P = 4,096) 

298 (P = 4,096) 

D 

259 (P = 4,096) 

190 (P = 4,096)[B] 

E 

229 (P = 4,096)[B] 

194 (P = 4,096) 

E 

233 (P = 4,096) 

210 (P = 4,096) 

G 

258 (P = 4,096) 

240 (P = 4,096) 

H 

253 (P=4,096) 

236 (P=4,096) 

90,000/EX10 A 

1,248 {P= 1,369) 

1,183 {P= 1,369) 

B 

691 (P= 1,024)[S] 

648 (P = 1,024)[S] 

C 

835 (P= 1,369) 

779 (P = 1,369) 

D 

339 (P= 1,369) 

166 (P = 1,024)[B][S] 

E 

262 (P = 1,369) 

233 {P= 1,024)[S] 

F 

250 (P = 1,369)[B] 

222 (P = 1,369) 

G 

314 {P= 1,024)[S] 

283 {P= 1,024)[S] 

H 

484 (P= 1,369) 

456 (P= 1,369) 

90,000/Altix A 

1,985 (P = 256) 

1,675 (P = 256) 

B 

1,883 (P = 256) 

1,586 (P = 256) 

C 

1,538 (P = 256) 

1,240 (P = 256) 

D 

1,621 (P = 256) 

594 (P = 256) 

D' 

2,621 (P = 256) 

585 (P = 256)[B] 

E 

1,558 (P = 256) 

1,287 (P = 256) 

F 

1,670 (P = 256) 

1,392 (P = 256) 

G 

1,453 (P = 256)[B] 

1,170 (P = 256) 

H 

2,612 (P=256) 

2,261 (P=256) 

22,500/K A 

65.2 (P = 1,024) 

59.6 (P = 256) 

B 

45.8 {P= 1,024)[S] 

43.2 {P= 1,024)[S] 

C 

41.7 (P = 2,025) 

37.8 (P = 2,025) 

D 

28.4 (P = 2,025) 

22.6 (P = 1,024) 

E 

28.3 (P = 2,025)[B] 

22.6 (P = 1,024)[B] 

F 

28.8 {P= 1,024)[S] 

26.9 (P = 1,024)[S] 

G 

29.7 {P= 1,024)[S] 

27.8 {P= 1,024)[S] 

H 

39.3(P=1024)[S] 

37.5(P=1024)[S] 

22,500/EX10 A 

126.2 (P = 256) 

118.1 (P = 256) 

B 

71.3 (P = 256)[S] 

67.1 (P = 256)[S] 

C 

103.5 (P = 256)[S] 

96.3 (P = 256)[S] 

D 

30.5 (P = 529)[B] 

24.4 (P = 529)[B] 

E 

34.3 (P = 256)[S] 

31.2 (P = 256)[S] 

F 

32.1 (P = 529) 

29.4 (P = 529) 

G 

45.3 (P = 529) 

42.5 (P = 529) 

H 

74.9(P=529) 

72.2 (P=529) 

22,500/Altix A 

51.4 (P = 256) 

42.1 (P = 256) 

B 

70.0 (P = 256) 

50.7 (P = 256) 

C 

45.6 (P = 256) 

35.5 (P = 256) 

D 

41.8 (P = 256) 

22.3 (P = 256)[B] 

D' 

59.6 (P = 256) 

21.8 (P = 256)[B] 

E 

32.3 (P = 256)[B] 

26.7 (P = 256) 

E 

48.5 (P = 256) 

37.3 (P = 256) 

G 

57.2 (P = 256) 

39.6 (P = 256) 

H 

71.2 (P=256) 

64.1 (P=256) 


ELPA paper [3] reported that the choice of a (near-)square num¬ 
ber for P can give better performance. 

When the non-traditional SEP solver algorithm of ELPA is 
used on Altix, one can choose an optimized low-level routine 


using SSE instructions (‘REAL_ELPA_KERNEL_SSE’) and a 
generic routine (‘REAL_ELPA_KERNEL_GENERIC’). [3] The 
optimized code can run only on the Intel-based architectures com¬ 
patible to SSE instructions and was prepared so as to accelerate 
the backtransformation subroutine. Among the results on Altix, 
the ‘ELPA2’ solver and the workflow D on Altix are those with 
the optimized routine, while the ‘ELPA2'’ solver and the work- 
flow D' are those with the generic routine. 


(a) M = 430,080 


(b) M = 430,080 
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(c) M = 430,080 
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Fig. 4 Results with M=430,800 on the K computer. The elapse times are 
plotted with the workflows for the (a) full (Tfuii) and (b) eigenvalue- 
only (Tevo) calculations, (c) The decomposed times for the SEP 
solver (Tsep) and for the reducer (Tred) are plotted. The routines 
for the reducers is labeled by ‘(RED)’. Detailed decomposed times 
for subprocedures of the ELPA style reducer and the Cholesky de¬ 
composition in the ScaLAPACK style reducer are also plotted in (c). 
The ideal speedup in parallelism is drawn as a dashed gray line. 


4.1 Result with the matrix size of M = 430,080 

The benchmark with the matrix size of M = 430,080 was car¬ 
ried out for up to P = 10,000 nodes on the K computer. The elapse 
times for P= 10,000 nodes is shown in Table 2. The elapse time 
for all the cases are shown in Pig. 4 for the (a) full (Tfun) or (b) 
eigenvalue-only (Tevo) calculations. The decomposed times are 
also shown in Pig. 4 (c) for the SEP solver (Tsep) and the reducer 
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(T'red) (T'fuii = Tsep + T'red)* 

Table 3 Decomposition of the elapse time (sec) of the SEP solvers with 
M=430,080 and P = 10,000. See the text for the subroutine names 
of ‘TRD/BAND’ , ‘D&C’ and ‘BACK’. 


SEP solver 

TRD/BAND 

D&C 

BACK 

Total (Tsep) 

SCLA 

3,055 

465 

633 

4,152 

ELPA2 

966 

141 

1,892 

2,999 

ELPAl 

1,129 

138 

400 

1,667 

EIGS 

1,058 

196 

265 

1,521 

EIGX 

828 

390 

255 

1,473 


Table 3 shows the decomposed time of the SEP solvers for 
P= 10,000. A SEP solver routine is decomposed into three sub¬ 
routines of (i) the tridiagonalization or narrow-band reduction 
(‘TRD/BAND’), (ii) the divide and conquer algorithms for the 
tridiagonal or narrow-band matrices (‘D&C’) so as to compute 
the eigenvalues, and (iii) the backtransformation of eigenvectors 
(‘BACK’) so as to compute the eigenvectors of the GEP 

One can observe several features on the results; (I) In the 
full calculation benchmark (Pig. 4(a)), the best data, the smallest 
elapse time, appears in the workflow G for P= 10,000. The work- 
flow G is the hybrid solver that uses the ‘Eigen_sx’ SEP solver 
in EigenExa and the ELPA style reducer, since these routines are 
the best among the SEP solvers and the reducers, respectively, as 
shown in Pig. 4(c) and Table 3. In Table 2, the speed (T^“}i) of 
the workflow G is approximately four times faster than that of 
the conventional workflow A (11,634 sec) / (2,734 sec) ^ 4.3). 
(II) Pig. 4 (c) shows that the ELPA style reducer gives signifi¬ 
cantly smaller elapse times than those of ScaLAPACK and those 
of EigenExa. The elapse time for P= 10,000 is T^ed = 1,261 sec 
with the ELPA style reducer and is T^ed = 2,157 sec with the 
EigenExa reducer. The elapse time with the EigenExa reducer is 
governed by that of the SEP solver for Eq. (7) (Tsep = 1,473 sec in 
Table 3). (Ill) In the eigenvalue-only calculation (Pig. 4(b)), the 
best data, the smallest elapse time, appears in the workflow D for 
P= 10,000. The workflow D is the solver that uses the ‘ELPA2’ 
SEP solver and the ELPA style reducer and the eigenvector calcu¬ 
lation consumes a large elapse time of Tyec; Tyec = Tmi - Tevo = 
(4,242 sec) - (2,227 sec) = (2,015 sec) in Table 2. The time Tyec 
is contributed mainly by the backward transformation subroutine 
(Track =1,892 sec) in Table 3, because the backward transforma¬ 
tion subroutine in ELPA2 uses a characteristic two-step algorithm 
(See Sec. 4.3 of Ref. [3]). 

4.2 Benchmark with the matrix sizes of M=90,000, 22,500 

The benchmark with the smaller matrix sizes of M = 90,000 
and 22,500 are also investigated. The maximum number of used 
processor nodes is Pmax = 4,096, 1,039 and 256, on the K com¬ 
puter, EX 10, and Altix, respectively. Pigures 5 and 6 show the 
data with M=90,000 and with M=22,500, respectively. The de¬ 
composed times are shown in Pig. 7. Table 2 shows the best data 
for each workflow among the different numbers of used nodes. 

We observed on Altix that the ‘ELPA2’ and ‘ELPA2'’ SEP solver re¬ 
quired non-blocking communication requests beyond the default limit 
number of A^mpi_max = 16, 384 and the job stopped with an MPI er¬ 
ror message. Then we increased the limit number to A^mpijviax = 
1,048,576, the possible maximum of the machine by the environment 
variable ‘MPI_REQUESTjyiAX’ and the calculations were completed. 


The results will help general simulation researchers to choose the 
solver and the number of used nodes, since the elapse times in 
Table 2 are less than a half hour and such calculations are popular 
‘regular class’ jobs among systematic investigations. 

Here, the results are discussed; (I) Table 2 shows that the small¬ 
est elapse time in the full calculation appears among the work- 
flows with the ELPA style reducer (the workflows D, E, F, and 
G) and that in the eigenvalue-only calculation appears with the 
workflow D. The above features are consistent to the results in 
the previous subsection. (II) Unlike the result in the previous 
subsection, the speed up is sometimes saturated. An example is 
observed in Pig. 6 (a), in the full calculation with M=22,500 on 
the K computer, because the elapse time in the workflow F gives 
a minimum as the function of P at P= 1,024. The decomposition 
analysis of Pig. 7(b) indicates that the saturation occur both for 
the SEP solver and the reducer, which implies that the improve¬ 
ment both on the SEP solver and the reducer is desirable. The sat¬ 
urated cases are marked in Table 2 with the label of ‘[S]’. (III) 
Pinally, the SSE-optimized routine in the workflow D is com¬ 
pared with the generic routine in the workflow D' in the case of 
M = 90,000 on Altix with P =256. The SSE-optimized routine 
is prepared only in the backward transformation process. Since 
the process with the SSE-optimized routine or the generic one 
gives the elapse time of Tback = 929 sec or Tback = 1,872 sec, 
respectively, the process is accelerated with the SSE-optimized 
routine by 1,872 sec / 929 sec ^ 2.02. As shown in Table 2, the 
full calculation is accelerated with the SSE-optimized routine by 
2,621 sec / 1,621 sec ^ 1.62. 

4.3 Benchmark for a million dimensional matrix 

Pinally, the benchmark for a million dimensional matrix is 
discussed. A press release at 2013 [24] reported, as a world 
record, a benchmark of a million dimensional SEP carried out by 
EigenExa, in approximately one hour, on the full (82,944) nodes 
of the K computer. An eigenvalue problem with a million dimen¬ 
sional matrix (M=10^) seems to be the practical limitation of the 
present supercomputer, owing to the O(M^) operation cost. 

We calculated a million dimensional GEP at Dec. 2014 on the 
full (4,800) nodes of Oakleaf-PX. Since our computational 
resource was limited, only one calculation was carried out with 
the workflow G, because it gives the best data among those with 
M = 430,080 in Table 2. The calculation finished in approxi¬ 
mately a half day, as shown in Table 2 (Tfun = 39,919 sec and Tevo 
= 35,103 sec). The elapse time of the reducer (Fred = Tmi - Tsep 
= 15,179 sec) is smaller than but comparable to that of the SEP 
solver (Tsep = 24,740). The benchmark proved that the present 
code qualifies as a software applicable to massively parallel com¬ 
putation with up to a million dimensional matrix. 


One should remember that supercomputers are usually shared by many 
researchers who run many calculations in similar problem sizes succes¬ 
sively and/or simultaneously. 

No saturation is found on Altix, unlike on the K computer and EX 10, 
partially because the maximum number of used nodes (Pmax = 256) is 
smaller. 

We used EX 10 not the K computer, because EX 10 is in a newer archi¬ 
tecture with a lower B/E value and the result on EX 10 is speculated to be 
closer to that on the next-generation (exa-scale) machine. 
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Fig. 5 Benchmark with the matrix size of M=90,000, (I) on the K computer 
for the (a) full (eigenpair) and (b) eigenvalue-only calculation, (II) 
on FXIO for the (c) full and (d) eigenvalue-only calculation, (III) on 
Altix for the (e) full and (f) eigenvalue-only calculation. The ideal 
speedup in parallelism is drawn as a dashed gray line. 


Fig. 6 Benchmark with the matrix size of M=22,500, (I) on the K computer 
for the (a) full (eigenpair) and (b) eigenvalue-only calculation, (II) 
on FXIO for the (c) full and (d) eigenvalue-only calculation, (III) on 
Altix for the (e) full and (f) eigenvalue-only calculation. The ideal 
speedup in parallelism is drawn as a dashed gray line. 


5. Discussions 

5.1 Preparation of initial distributed data 

In the benchmark, the initial procedures including file read¬ 


ing are carried out for the preparation of distributed data. Its 
elapse time is always small and is ignored in the previous sec¬ 
tion. These procedures, however, may consume significant 

In the case of the workflow G on the K computer with M=430,080 
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(a) M = 90,000 (b) M = 22,500 



(c) M = 90,000 (d) M = 22,500 



(e) M = 90,000 (f) M = 22,500 



Fig. 7 Decomposition analysis of the elapse time into those of the SEP 
solver and the reducer (I) on the K computer with (a) M=90,000 and 
(b) M=22,500, (II) on FXIO with (c) M=90,000 and (d) M=22,500, 
(III) on Altix with (c) M=90,000 and (d) M=22,500. The routines for 
the reducers is labeled by ‘(RED)’. The ‘ELPA2'’ SEP solver is that 
without the SSE optimized routine. The ideal speedup in parallelism 
is drawn as a dashed gray line. 


and P= 10,000, for example, the elapse time of the initial procedures 


elapse times, when the present solver is used as a middleware 
with real applications. The discussions on such cases are beyond 
the present scope, since they depend on the program structure 
of the real applications. Here, several comments are added for 
real application developers; In general, the matrix data cost is, at 
most, O(M^) and the operation cost is O(M^) in the dense-matrix 
solvers and one should consider a balance between them. In the 
case of M = 430,080, for example, the required memory size for 
all the matrix elements is 8 B x ^1.5 TB, which can not be 
stored on a node of the K computer. Therefore, the data should 
be always distributed. In our real application (ELSES), the initial 
distributed data is prepared, when only the required elements are 
generated and stored on each node. 

5.2 Data conversion overhead 

As explained in Sec. 3.1, several workflows require data con¬ 
version processes between distributed data formats, since ScaLA- 
PACK and ELPA use block cyclic distribution with a given block 
size nbiock(> 1) and EigenExa uses cyclic distribution (^biock = !)• 
In the present benchmark, the block size /ibiock in ScaLAPACK 
and ELPA was set to be ^biock = 128, a typical value. Conse¬ 
quently, the workflows B, E, G require data conversion processes. 
In the present paper, the elapse time of the conversion procedures 
is included in the reducer part (Tj-ed)- 

Table 4 shows the elapse time for the data conversion. The 
elapse times are shown in the cases with the maximum num¬ 
bers of used nodes (P = Pmax) among the present benchmark. 
Two data conversion procedures are required. One is the conver¬ 
sion from the block cyclic distribution into the cyclic distribution, 
shown as ‘(b ^ 1)’ in Table 4 and the other is the inverse process 
shown as ‘(1 b)’* The two procedures are carried out, com¬ 

monly, by the ‘pdgemr2d’ routine in ScaLAPACK. 

Table 4 indicates that the overhead of the data conversion pro¬ 
cedures is always small and is not the origin of the saturation. In 
general, the conversion requires an O(M^) operation cost, while 
the calculation in a dense-matrix solver requires an O(M^) opera¬ 
tion cost. The fact implies the general efficiency of hybrid solvers, 
at least, among dense-matrix solvers. 

Table 4 The elapse times for data conversion; ‘(b ^ 1)’, ‘(1 ^ b)’ and 
‘Tred’ are the times in seconds for, the conversion process from 
block cyclic into cyclic distributions, the inverse process and the 
whole reducer procedure, respectively. The saturated data of Tred 
are labelled by ‘[S]’. The ‘ratio’ is ((b ^ 1) + (1 ^ b)) / Tred- 


Size M 

Machine(P) 

(b^l) 

(l^b) 

Tred 

ratio[%] 

1,008,000 

FX10(4,800) 

51.4 

51.7 

8,208 

1.26 

430,080 

K(10,000) 

13.4 

6.48 

1,261 

1.58 

90,000 

K(4,096) 

6.89 

0.797 

124[S] 

6.21 


FX10(1,369) 

1.89 

0.973 

84.0[S] 

3.41 


Altix(256) 

2.01 

2.02 

394 

1.02 

22,500 

K(2,025) 

0.571 

0.610 

11.3[S] 

10.4 


FX10(529) 

0.328 

0.176 

9.20 

5.48 


Altix(256) 

0.120 

0.279 

11.9 

3.35 


is Tini =123 sec and is much smaller than that of the total computation 
(Tot = 2,734sec. See Table. 2). It is noteworthy that the present matri¬ 
ces are sparse, as explained in Sec. 2. 
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5.3 Decomposition analysis of the reducer 

The decomposition analysis of the ELPA-style reducer is fo¬ 
cused on, since the ELPA-style reducer is fastest among the three 
libraries. Eigure 4 (c) shows the case on the K computer with 
M=430,080. The elapse times of the subprocedures of the ELPA- 
style reducer are plotted; ‘ELPA(i?i)’ is the Cholesky factoriza¬ 
tion of Eq. (4), ‘ELPA(i? 2 )’ is the explicit calculation of the in¬ 
version R = U~^ of the Cholesky factor U, ‘ELPA(i? 3 )’ and 
‘ELPA(i? 4 )’ are the successive matrix multiplication of Eq. (5) 
and ‘ELPA(i? 5 )’ is the backward transformation of eigenvectors 
by matrix multiplication of Eq. (6). The elapse times of the 
Cholesky factorization in the ScaLAPACK style reducer is also 
plotted as ‘SCLA(i?i)’ as a reference data. The same decom¬ 
position analysis is carried out also for other cases, as shown in 
Pig. 8. One can observe that the Cholesky factorization of the 
ELPA-style reducer does not scale and sometimes is slower than 
that of the ScaLAPACK reducer. In particular, the saturation of 
the ELPA-style reducer is caused by that of the Cholesky factor¬ 
ization in Pig. 8 (a)(b)(c). 

The above observation implies that the reducer can be a seri¬ 
ous bottleneck in the next-generation (exa-scale) supercomputers, 
though not in the present benchmark. One possible strategy is the 
improvement on the Cholesky factorization for better scalability 
and another is the development of a reducer without the Cholesky 
factorization, as in the EigenExa-style reducer. 

6. Summary and future outlook 

In summary, hybrid GEP solvers were constructed between 
the three parallel dense-matrix solver libraries of ScaLAPACK, 
ELPA and EigenExa. The benchmark was carried out with up 
to a million dimensional matrix on the K computer and other 
supercomputers. The hybrid solvers with ELPA and EigenExa 
give better benchmark results than the conventional ScaLAPACK 
library. The code was developed as a middleware and a mini¬ 
application and will appear online. Several issues are discussed. 
In particular, the decomposition analysis of the elapse time re¬ 
veals a potential bottleneck part on next-generation (exa-scale) 
supercomputers, which indicates the guidance for future devel¬ 
opment of the algorithms and the codes. 

As a future outlook, the present code for the hybrid solvers 
is planned to be extended by introducing the solvers with differ¬ 
ent mathematical foundations. A candidate is the parallel block 
Jacobi solver [25], [26]. Since the solver is applicable only to 
standard eigenvalue problems, the hybrid solver enables us to use 
the solver in generalized eigenvalue problems. 
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Fig. 8 Decomposition analysis of the elapse time of subprocedures of the 
ELPA style reducer and the Cholesky factorization in the ScaLA¬ 
PACK style reducer (I) on the K computer with (a) M=90,000 and (b) 
M=22,500, (II) on LXIO with (c) M=90,000 and (d) M=22,500, (III) 
on Altix with (c) M=90,000 and (d) M=22,500. The ideal speedup 
in parallelism is drawn as a dashed gray line. 
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